83 research outputs found

    Corpus-Based Techniques for Word Sense Disambiguation

    Get PDF
    The need for robust and easily extensible systems for word sense disambiguation coupled with successes in training systems for a variety of tasks using large on-line corpora has led to extensive research into corpus-based statistical approaches to this problem. Promising results have been achieved by vector space representations of context, clustering combined with a semantic knowledge base, and decision lists based on collocational relations. We evaluate these techniques with respect to three important criteria: how their definition of context affects their ability to incorporate different types of disambiguating information, how they define similarity among senses, and how easily they can generalize to new senses. The strengths and weaknesses of these systems provide guidance for future systems which must capture and model a variety of disambiguating information, both syntactic and semantic

    Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments

    Full text link
    This study presents a corpus of turn changes between speakers in U.S. Supreme Court oral arguments. Each turn change is labeled on a spectrum of "cooperative" to "competitive" by a human annotator with legal experience in the United States. We analyze the relationship between speech features, the nature of exchanges, and the gender and legal role of the speakers. Finally, we demonstrate that the models can be used to predict the label of an exchange with moderate success. The automatic classification of the nature of exchanges indicates that future studies of turn-taking in oral arguments can rely on larger, unlabeled corpora.Comment: To be appear in Proceedings of INTERSPEECH 202

    Characterizing and recognizing spoken corrections in human-computer dialog

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 103-106).Miscommunication in human-computer spoken language systems is unavoidable. Recognition failures on the part of the system necessitate frequent correction attempts by the user. Unfortunately and counterintuitively, users' attempts to speak more clearly in the face of recognition errors actually lead to decreased recognition accuracy. The difficulty of correcting these errors, in turn, leads to user frustration and poor assessments of system quality. Most current approaches to identifying corrections rely on detecting violations of task or belief models that are ineffective where such constraints are weak and recognition results inaccurate or unavailable. In contrast, the approach pursued in this thesis, in contrast, uses the acoustic contrasts between original inputs and repeat corrections to identify corrections in a more content- and context-independent fashion. This thesis quantifies and builds upon the observation that suprasegmental features, such as duration, pause, and pitch, play a crucial role in distinguishing corrections from other forms of input to spoken language systems. These features can also be used to identify spoken corrections and explain reductions in recognition accuracy for these utterances. By providing a detailed characterization of acoustic-prosodic changes in corrections relative to original inputs in a voice-only system, this thesis contributes to natural language processing and spoken language understanding. We present a treatment of systematic acoustic variability in speech recognizer input as a source of new information, to interpret the speaker's corrective intent, rather than simply as noise or user error. We demonstrate the application of a machine-learning technique, decision trees, for identifying spoken corrections and achieve accuracy rates close to human levels of performance for corrections of misrecognition errors, using acoustic-prosodic information. This process is simple and local and depends neither on perfect transcription of the recognition string nor complex reasoning based on the full conversation. We further extend the conventional analysis of speaking styles beyond a 'read' versus 'conversational' contrast to extreme clear speech, describing divergence from phonological and durational models for words in this style.by Gina-Anne Levow.Ph.D

    Rapidly retargetable interactive translingual retrieval

    Get PDF
    This paper describes a system for rapidly retargetable interactive translingual retrieval. Basic functionality can be achieved for a new document language in a single day, and further improvements require only a relatively modest additional investment. We applied the techniques rst to search Chinese collections using English queries, and have successfully added French, German, and Italian document collections. We achieve this capability through separation of language-dependent and language-independent components and through the application of asymmetric techniques that leverage an extensive English retrieval infrastructure.

    The prosody of negative yeah

    Get PDF
    Normally, yeah has positive polarity, but with a change in prosody, it can convey a negative stance (e.g., polite disagreement/rejection). This study examines acoustic-prosodic features of "negative yeah" in a stance-rich corpus of collaborative tasks. Four categories are identified based on degree of agreement/acceptance and distinguished by an interaction between pitch and intensity: while two groups have low, flat pitch, and two have high domed or dipping contours, this division is cross-cut by intensity, again low-flat vs. high domed. These patterns show that fine-grained stance analysis can reveal word-level acoustic patterns that are not apparent in coarser approaches.

    Chinese-English Semantic Resource Construction

    Get PDF
    We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources--a classification of English verbs called EVCA (English Verbs Classes and Alterations) [Levin, 1993], a Chinese conceptual database called HowNet [Zhendong, 1988c, Zhendong, 1988b] (http://www.how-net.com), and a large machine-readable dictionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval. (Also cross-referenced as UMIACS-TR-2000-27) (Also cross-referenced as LAMP-TR-044

    Named Entity Recognition for Bacterial Type IV Secretion Systems

    Get PDF
    Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents

    Exploiting Multiple Embeddings for Chinese Named Entity Recognition

    Full text link
    Identifying the named entities mentioned in text would enrich many semantic applications at the downstream level. However, due to the predominant usage of colloquial language in microblogs, the named entity recognition (NER) in Chinese microblogs experience significant performance deterioration, compared with performing NER in formal Chinese corpus. In this paper, we propose a simple yet effective neural framework to derive the character-level embeddings for NER in Chinese text, named ME-CNER. A character embedding is derived with rich semantic information harnessed at multiple granularities, ranging from radical, character to word levels. The experimental results demonstrate that the proposed approach achieves a large performance improvement on Weibo dataset and comparable performance on MSRA news dataset with lower computational cost against the existing state-of-the-art alternatives.Comment: accepted at CIKM 201
    • …
    corecore